Our project is based on the Kaggle Competition Google Analytics Customer Revenue Prediction and the following are from the descritption of this competition.
The 80/20 rule has proven true for many businesses–only a small percentage of customers produce most of the revenue. As such, marketing teams are challenged to make appropriate investments in promotional strategies.
RStudio, the developer of free and open tools for R and enterprise-ready products for teams to scale and share work, has partnered with Google Cloud and Kaggle to demonstrate the business impact that thorough data analysis can have.
In this project, we will analyze a Google Merchandise Store (also known as GStore, where Google swag is sold) customer dataset to predict revenue per customer. Hopefully, the outcome will be more actionable operational changes and a better use of marketing budgets for those companies who choose to use data analysis on top of GA data
The data can be downloaded at github
The original data is from Kaggle contest “Google Analytics Customer Revenue Prediction”. And also hits part from google analytics API (which is also included in Kaggel version 2 data). Raw Data
A brief description of the original data can be found from Kaggle
fullVisitorId- A unique identifier for each user of the Google Merchandise Store.
channelGrouping - The channel via which the user came to the Store.
date - The date on which the user visited the Store.
device - The specifications for the device used to access the Store.
geoNetwork - This section contains information about the geography of the user.
socialEngagementType - Engagement type, either “Socially Engaged” or “Not Socially Engaged”.
totals - This section contains aggregate values across the session.
trafficSource - This section contains information about the Traffic Source from which the session originated.
visitId - An identifier for this session. This is part of the value usually stored as the _utmb cookie. This is only unique to the user. For a completely unique ID, you should use a combination of fullVisitorId and visitId.
visitNumber - The session number for this user. If this is the first session, then this is set to 1.
visitStartTime - The timestamp (expressed as POSIX time).
hits - This row and nested fields are populated for any and all types of hits. Provides a record of all page visits.
customDimensions - This section contains any user-level or session-level custom dimensions that are set for a session. This is a repeated field and has an entry for each dimension that is set.
totals - This set of columns mostly includes high-level aggregate data.
Where the totals and hits are all json object, and we expand it. Totals contain high-level aggregated data, like revenue is in the totals.
Hits is the detailed web action the visitor have. It will record every website visited and action the visitor done. Therefore the transaction columns are in the hits, which means if you use hits to predict the revenue then you can atucally fully recover those digits. Therefore we remove all columns related directly to transaction.
There are still lots of other features remian in hits.
time – Millisecond of visiting time for each pages
Product - Which good are visitor currently viewing, also contain their price and name
page.* - The exact web path along the visitor’s visiting.
social.* - If the visitor have a social network referral and where the referral from.
content* - What class of content are visitor viewing
eventInfo* - The action visitor took during the whole visiting
And also some other technical columns not easy to understand. In this version of hits data, it contain multiple rows for one visitor, therefore we need to get some features and summarise it. We do this through a detailed EDA of the hits data only and find some patterns for buyers. Therefore we reduce our hits data according to these patterns. And only use the reduced features and no original data, doing a randomforest already improve result to some extent.
The Summary Data is like:
time_* - Some quantiles of page visiting duration
*_count - The count of certain action or state the visitor have during whole visiting. Say social_count is for the referral he had. Bags_count is the count of pages related to Bags, and Click_count the count for number of clicking on good pictures the visitor had.
price_* - Some quantile of good prices the visitor took action on
just.view - A summary statistics telling you if the visitor is just viewing and have no actions.
Need to mention that
library(ggplot2)
library(lubridate)
library(usmap)
library(tidyverse)
library(reshape)
library(knitr)
library(kableExtra)
library(plotly)
# train = read.csv('/Users/alice/Documents/yr2term1/data
# science/project/Data/train_US_1year_nojson.csv')
load("./dataset/US_1year.Rdata")
train = dat
Original data is missing at huge ratio, let’s explore and try to deal with it
train$totalTransactionRevenue[which(is.na(train$totalTransactionRevenue))] = 0
miss = data.frame(Features = colnames(train), Number = sapply(1:ncol(train),
function(x) length(which(is.na(train[, x])))), Percent = round(sapply(1:ncol(train),
function(x) length(which(is.na(train[, x]))))/nrow(train), 2))
miss_bar = miss[(which(miss$Percent != 0)), ]
plot_ly(miss_bar, y = ~reorder(Features, -Percent), x = ~Percent, type = "bar",
text = miss_bar$Percent, orientation = "h") %>% layout(title = "Missing data",
xaxis = list(title = "Percent"), yaxis = list(title = "Features"))
# plot_ly(miss_bar,x=~reorder(Features,-Percent),y=~Percent,type =
# 'bar',text = miss_bar$Percent)%>% layout(title = 'Missing data', yaxis =
# list(title = 'Percent'), xaxis = list(title = 'Features'))
miss %>% mutate_if(is.numeric, function(x) {
cell_spec(x, bold = T, color = spec_color(x, end = 0.9), font_size = spec_font_size(x))
}) %>% kable(escape = F, align = "c", caption = "Missing data summary") %>%
kable_styling(c("striped", "condensed"), full_width = F)
| Features | Number | Percent |
|---|---|---|
| channelGrouping | 0 | 0 |
| date | 0 | 0 |
| fullVisitorId | 0 | 0 |
| visitId | 0 | 0 |
| visitNumber | 0 | 0 |
| visitStartTime | 0 | 0 |
| browser | 1 | 0 |
| operatingSystem | 295 | 0 |
| isMobile | 0 | 0 |
| deviceCategory | 0 | 0 |
| region | 234276 | 0.49 |
| metro | 236260 | 0.5 |
| city | 235617 | 0.5 |
| networkDomain | 269437 | 0.57 |
| referralPath | 348496 | 0.74 |
| campaign | 417281 | 0.88 |
| source | 58 | 0 |
| medium | 180932 | 0.38 |
| isTrueDirect | 280401 | 0.59 |
| keyword | 419443 | 0.89 |
| adContent | 429948 | 0.91 |
| adwordsClickInfo.page | 424814 | 0.9 |
| adwordsClickInfo.slot | 424814 | 0.9 |
| adwordsClickInfo.gclId | 424732 | 0.9 |
| adwordsClickInfo.adNetworkType | 424814 | 0.9 |
| adwordsClickInfo.isVideoAd | 424814 | 0.9 |
| hits1 | 0 | 0 |
| pageviews | 80 | 0 |
| timeOnSite | 197122 | 0.42 |
| sessionQualityDim | 89294 | 0.19 |
| newVisits | 154702 | 0.33 |
| transactions | 0 | 0 |
| transactionRevenue | 39 | 0 |
| totalTransactionRevenue | 0 | 0 |
| bounces | 277342 | 0.59 |
dat = train[, which(miss$Percent < 0.8)]
# test_shiny = train[1:100,-c(32:35)]
# write.csv(test_shiny,file='/Users/alice/Documents/yr2term1/data
# science/homework/google_revenue/test_shiny.csv')
We droped the features which have greater than 80% missing values.
Here we explore the marketing channel grouping distribution.
cg = data.frame(table(train$channelGrouping))
percent <- function(x, digits = 2, format = "f", ...) {
paste0(formatC(100 * x, format = format, digits = digits, ...), "%")
}
cg = cg %>% mutate(percent = percent(Freq/sum(Freq)))
cg = cg[order(-cg$Freq), ]
plot_ly(cg, x = ~reorder(Var1, -Freq), y = ~Freq, type = "bar", text = cg$percent) %>%
layout(title = "Channel Grouping", xaxis = list(title = "Channel Grouping"),
yaxis = list(title = "Number"))
Over 40% of the visitors used organic search.
Here we explore the browser distribution.
names(sort(table(train$browser), decreasing = TRUE)[1:10])
## [1] "Chrome" "Safari" "Firefox"
## [4] "Samsung Internet" "Internet Explorer" "Edge"
## [7] "Android Webview" "Safari (in-app)" "Amazon Silk"
## [10] "Opera"
train_brow = train[which(train$browser %in% names(sort(table(train$browser),
decreasing = TRUE)[1:10])), ]
brow = data.frame(table(train_brow$browser))
brow = brow[-which(brow$Freq == 0), ]
brow = brow %>% mutate(percent = percent(Freq/sum(Freq)))
plot_ly(brow, x = ~reorder(Var1, -Freq), y = ~Freq, type = "bar", text = brow$percent) %>%
layout(title = "Top 10 Browser", xaxis = list(title = "Browser"), yaxis = list(title = "Number"))
Most of the visitors used chrome browser. The most used browsers are chrome, safari, firefox.
Here we explore the device category distribution.
dc = data.frame(table(train$deviceCategory))
dc = dc %>% mutate(percent = percent(Freq/sum(Freq)))
plot_ly(dc, x = ~reorder(Var1, -Freq), y = ~Freq, type = "bar", text = dc$percent) %>%
layout(title = "Device Category", xaxis = list(title = "Device Category"),
yaxis = list(title = "Number"))
colnames(train)
## [1] "channelGrouping" "date"
## [3] "fullVisitorId" "visitId"
## [5] "visitNumber" "visitStartTime"
## [7] "browser" "operatingSystem"
## [9] "isMobile" "deviceCategory"
## [11] "region" "metro"
## [13] "city" "networkDomain"
## [15] "referralPath" "campaign"
## [17] "source" "medium"
## [19] "isTrueDirect" "keyword"
## [21] "adContent" "adwordsClickInfo.page"
## [23] "adwordsClickInfo.slot" "adwordsClickInfo.gclId"
## [25] "adwordsClickInfo.adNetworkType" "adwordsClickInfo.isVideoAd"
## [27] "hits1" "pageviews"
## [29] "timeOnSite" "sessionQualityDim"
## [31] "newVisits" "transactions"
## [33] "transactionRevenue" "totalTransactionRevenue"
## [35] "bounces"
The most used device is desktop.
IM = data.frame(table(train$isMobile))
IM = IM %>% mutate(percent = percent(Freq/sum(Freq)))
plot_ly(IM, x = ~reorder(Var1, -Freq), y = ~Freq, type = "bar", text = IM$percent) %>%
layout(title = "isMobile", xaxis = list(title = "isMobile"), yaxis = list(title = "Number"))
Here we explore the device operating system.
train_os = train[which(train$operatingSystem %in% names(sort(table(train$operatingSystem),
decreasing = TRUE)[1:10])), ]
os = data.frame(table(train_os$operatingSystem))
os = os[-which(os$Freq == 0), ]
os = os %>% mutate(percent = percent(Freq/sum(Freq)))
plot_ly(os, x = ~reorder(Var1, -Freq), y = ~Freq, type = "bar", text = os$percent) %>%
layout(title = "Top 10 Operating System", xaxis = list(title = "Operating System"),
yaxis = list(title = "Number"))
Most of the visitors used Mac. The most used operating systems are Mac, Windows, IOS, Android.
Here we explore the device network domain.
train_nd = train[which(train$networkDomain %in% names(sort(table(train$networkDomain),
decreasing = TRUE)[1:10])), ]
networkDomain = data.frame(table(train_nd$networkDomain))
networkDomain = networkDomain[-which(networkDomain$Freq == 0), ]
networkDomain = networkDomain %>% mutate(percent = percent(Freq/sum(Freq)))
plot_ly(networkDomain, x = ~reorder(Var1, -Freq), y = ~Freq, type = "bar", text = networkDomain$percent) %>%
layout(title = "Top 10 Network Domain", xaxis = list(title = "Network Domain"),
yaxis = list(title = "Number"))
The most used network domains are comcast, rr, verizon.
Now we check these features correlation.
ob = table(train$browser, train$operatingSystem)
ob = ob[, which(colSums(ob) != 0)]
ob = ob[which(rowSums(ob) != 0), ]
ob = melt(ob)
ggplot(ob, aes(Var.1, Var.2)) + geom_tile(aes(fill = value), colour = "white") +
scale_fill_gradient(low = "white", high = "steelblue") + theme(plot.title = element_text(hjust = 0.5),
axis.text.x = element_text(angle = 45, hjust = 1, size = 13)) + geom_text(aes(label = round(value,
1))) + labs(title = "Browser vs. OperatingSystem ", x = "Browser", y = "OperatingSystem")
The most frequent combiniation of the visits is using Chrome browser from Mac.
dd = table(train_brow$browser, train_brow$deviceCategory)
dd = dd[, which(colSums(dd) != 0)]
dd = dd[which(rowSums(dd) != 0), ]
dd = melt(dd)
ggplot(dd, aes(Var.2, Var.1)) + geom_tile(aes(fill = value), colour = "white") +
scale_fill_gradient(low = "white", high = "steelblue") + geom_text(aes(label = round(value,
1))) + theme(plot.title = element_text(hjust = 0.5), axis.text.x = element_text(angle = 45,
hjust = 1, size = 13)) + labs(title = "Browser vs. Device Category ", x = "Device Category",
y = "Browser")
The most frequent combiniation of the visits is using Chrome browser on desktop.
mc = table(train_brow$isMobile, train_brow$deviceCategory)
mc = mc[, which(colSums(mc) != 0)]
mc = mc[which(rowSums(mc) != 0), ]
mc = melt(mc)
ggplot(mc, aes(Var.1, Var.2)) + geom_tile(aes(fill = value), colour = "white") +
scale_fill_gradient(low = "white", high = "steelblue") + theme(plot.title = element_text(hjust = 0.5),
axis.text.x = element_text(angle = 45, hjust = 1, size = 13)) + geom_text(aes(label = round(value,
1))) + labs(title = "Device is Mobile vs. Device Category ", x = "Device is Mobile ",
y = "Device Category")
Now we explore geographical attributes of the visit.
city = data.frame(table(train$city))
city = city[city$Freq > mean(city$Freq), ]
city = city %>% mutate(percent = percent(Freq/sum(Freq)))
plot_ly(city, x = ~reorder(Var1, -Freq), y = ~Freq, type = "bar", text = city$percent) %>%
layout(title = "City", xaxis = list(title = "City"), yaxis = list(title = "Number"))
The most frequent city is Mountain View.
reg = data.frame(table(train$region))
reg = reg[reg$Freq > mean(reg$Freq), ]
reg = reg %>% mutate(percent = percent(Freq/sum(Freq)))
plot_ly(reg, x = ~reorder(Var1, -Freq), y = ~Freq, type = "bar", text = reg$percent) %>%
layout(title = "Region", xaxis = list(title = "Region"), yaxis = list(title = "Number"))
The most frequent region is California.
metro = data.frame(table(train$metro))
metro = metro[metro$Freq > mean(metro$Freq), ]
metro = metro %>% mutate(percent = percent(Freq/sum(Freq)))
plot_ly(metro, x = ~reorder(Var1, -Freq), y = ~Freq, type = "bar", text = metro$percent) %>%
layout(title = "Metro", xaxis = list(title = "Metro"), yaxis = list(title = "Number"))
The most frequent metros are San Francisco, New York, and Los Angeles.
# The revenue corresponds to the different states in USA
l <- list(color = toRGB("white"), width = 2)
g <- list(scope = "usa", projection = list(type = "albers usa"), showlakes = TRUE,
lakecolor = toRGB("white"))
map.visit <- dat %>% select(region, transactionRevenue) %>% group_by(region) %>%
summarise(n = round(log(n()), 3), rev = round(log(sum(as.numeric(na.omit(transactionRevenue))) +
1), 3)) %>% rename(full = region)
state.google <- statepop %>% select(fips, abbr, full) %>% left_join(map.visit,
by = "full")
state.google$n[is.na(state.google$n)] = 0
# state.google$hover <- with(state.google, paste('State: ',full, '<br>',
# 'Revenue: ', n))
state.google$hover <- with(state.google, paste(full))
plot_geo(state.google, locationmode = "USA-states") %>% add_trace(z = ~n, text = ~hover,
locations = ~abbr, color = ~n, colors = "Greens") %>% colorbar(title = "log(Number of visit)") %>%
layout(title = "Visit per State (log scale)", geo = g)
The state with the largest number of visit is California, and the state with the second largest visit is New York.
plot_geo(state.google, locationmode = "USA-states") %>% add_trace(z = ~rev,
text = ~hover, locations = ~abbr, color = ~rev, colors = "Blues") %>% colorbar(title = "log(TransactionRevenue) USD") %>%
layout(title = "Google transaction revenue by State (log scale)", geo = g)
Consistent with the number of visit, the state with the highest transaction revenue is California, and the state with the second highest transaction revenue is New York.
Now we show the distribution of visit per state, considering only visits with non-zero transactions.
map.visit2 <- dat %>% select(region, transactionRevenue) %>% filter(transactionRevenue >
0) %>% group_by(region) %>% summarise(n = round(log(n()), 3), rev = round(log(sum(as.numeric(na.omit(transactionRevenue))) +
1), 3)) %>% rename(full = region)
state.google2 <- statepop %>% select(fips, abbr, full) %>% left_join(map.visit2,
by = "full")
state.google2$n[is.na(state.google2$n)] = 0
state.google2$hover <- with(state.google2, paste(full))
plot_geo(state.google2, locationmode = "USA-states") %>% add_trace(z = ~n, text = ~hover,
locations = ~abbr, color = ~n, colors = "Reds") %>% colorbar(title = "log(Number of visit) ") %>%
layout(title = "Visit per State in non-zero transactions (log scale)", geo = g)
The total numbers in each state are smaller, and the top 2 states with largest number of visits are California and New York.
dat = dat %>% mutate(date = ymd(date))
date = dat %>% select(date, transactionRevenue) %>% group_by((date)) %>% summarise(n = round(log(n()),
3), rev = round(log(sum(as.numeric(na.omit(transactionRevenue))) + 1), 3))
colnames(date)[1] = "Date"
date$hover <- with(date, paste("Date: ", Date, "<br>", "Visits: ", n, "<br>",
"Revenue: ", rev))
plot_ly(x = ~Date, y = ~rev, data = date, mode = "lines", hoverinfo = "text",
text = date$hover) %>% layout(title = "Revenues per day (log scale)", xaxis = list(title = "Day"),
yaxis = list(title = "Revenue"))
We plot the time series for transaction revenues.
year = dat %>% select(date, transactionRevenue) %>% group_by(year(date)) %>%
summarise(n = round(log(n()), 3), rev = round(log(sum(as.numeric(na.omit(transactionRevenue))) +
1), 3))
colnames(year)[1] = "Year"
year$Year = as.character(year$Year)
ggplot(year, aes(x = Year, y = n)) + geom_bar(stat = "identity", fill = "#FF6666") +
theme(plot.title = element_text(hjust = 0.5), axis.text.x = element_text(angle = 45,
hjust = 1, size = 13)) + geom_text(aes(label = n), vjust = 0) + labs(title = "Visit per year (log scale)",
x = "Year", y = "Number")
ggplot(year, aes(x = Year, y = rev)) + geom_bar(stat = "identity", fill = "darkblue") +
theme(plot.title = element_text(hjust = 0.5), axis.text.x = element_text(angle = 45,
hjust = 1, size = 13)) + geom_text(aes(label = rev), vjust = 0) + labs(title = "Revenue per year (log scale)",
x = "Year", y = "Number")
For 2016 and 2017, the numbers of visit in each year are similar, as well as tansaction revenue.
mm <- c("January", "February", "March", "April", "May", "June", "July", "August",
"September", "October", "November", "December")
month = dat %>% select(date, transactionRevenue) %>% group_by(month(date)) %>%
summarise(n = round(log(n()), 3), rev = round(log(sum(as.numeric(na.omit(transactionRevenue))) +
1), 3))
month = data.frame(Month = mm, month)
month$Month <- factor(month$Month, levels = month[["Month"]])
month$hover <- with(month, paste("Month: ", Month, "<br>", "Visits: ", n, "<br>",
"Revenue: ", rev))
plot_ly(x = ~Month, y = ~rev, data = month, type = "scatter", mode = "lines",
hoverinfo = "text", text = month$hover) %>% layout(title = "Revenues per month (log scale)",
xaxis = list(title = "Month"), yaxis = list(title = "Revenue"))
April has the highest revenue.
week = dat %>% select(date, transactionRevenue) %>% group_by(weekdays(date)) %>%
summarise(n = round(log(n()), 3), rev = round(log(sum(as.numeric(na.omit(transactionRevenue))) +
1), 3))
colnames(week)[1] = "Week"
# week$Week <- factor(week$Week, levels = week[['Week']]) # #
week$Week <- factor(week$Week, levels = c("Monday", "Tuesday", "Wednesday",
"Thursday", "Friday", "Saturday", "Sunday"))
week$hover <- with(week, paste("Week: ", Week, "<br>", "Visits: ", n, "<br>",
"Revenue: ", rev))
order = c(2, 6, 7, 5, 1, 3, 4)
week = week[order, ]
plot_ly(x = ~Week, y = ~rev, data = week, type = "scatter", mode = "lines",
hoverinfo = "text", text = week$hover) %>% layout(title = "Revenues per weekday (log scale)",
xaxis = list(title = "Weekday"), yaxis = list(title = "Revenue"))
Revenue is higher in weekdays than weekends.
In this stage we use cleaned original data (which not contain hits data from API) to do the prediction.
## Load packages
library(caret)
library(MASS)
library(glmnet)
library(xgboost)
library(keras)
## Read in data
load("./dataset/US_1year.Rdata")
## Take a quick glimpse at the data
glimpse(dat)
## Observations: 473,480
## Variables: 35
## $ channelGrouping <chr> "Referral", "Direct", "Referral...
## $ date <int> 20171016, 20171016, 20171016, 2...
## $ fullVisitorId <chr> "8934116514970143966", "7992466...
## $ visitId <int> 1508176307, 1508201613, 1508196...
## $ visitNumber <int> 6, 1, 1, 1, 10, 1, 1, 1, 1, 1, ...
## $ visitStartTime <int> 1508176307, 1508201613, 1508196...
## $ browser <chr> "Chrome", "Chrome", "Chrome", "...
## $ operatingSystem <chr> "Chrome OS", "Android", "Macint...
## $ isMobile <lgl> FALSE, TRUE, FALSE, TRUE, TRUE,...
## $ deviceCategory <chr> "desktop", "mobile", "desktop",...
## $ region <chr> "California", NA, "California",...
## $ metro <chr> "San Francisco-Oakland-San Jose...
## $ city <chr> "Cupertino", NA, "San Francisco...
## $ networkDomain <chr> NA, "windjammercable.net", NA, ...
## $ referralPath <chr> "/a/google.com/transportation/m...
## $ campaign <chr> NA, NA, NA, NA, "\"google + red...
## $ source <chr> "sites.google.com", "(direct)",...
## $ medium <chr> "referral", NA, NA, "organic", ...
## $ isTrueDirect <lgl> NA, TRUE, NA, NA, NA, NA, NA, N...
## $ keyword <chr> NA, NA, NA, NA, "(Remarketing/C...
## $ adContent <chr> NA, NA, NA, NA, "Placement Acce...
## $ adwordsClickInfo.page <int> NA, NA, NA, NA, NA, NA, NA, NA,...
## $ adwordsClickInfo.slot <chr> NA, NA, NA, NA, NA, NA, NA, NA,...
## $ adwordsClickInfo.gclId <chr> NA, NA, NA, NA, NA, NA, NA, NA,...
## $ adwordsClickInfo.adNetworkType <chr> NA, NA, NA, NA, NA, NA, NA, NA,...
## $ adwordsClickInfo.isVideoAd <lgl> NA, NA, NA, NA, NA, NA, NA, NA,...
## $ hits1 <int> 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3...
## $ pageviews <int> 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3...
## $ timeOnSite <int> 28, 38, 12, 51, 111, 8, 97, 72,...
## $ sessionQualityDim <int> 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1...
## $ newVisits <int> NA, 1, 1, 1, NA, 1, 1, 1, 1, 1,...
## $ transactions <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ transactionRevenue <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ totalTransactionRevenue <int> NA, NA, NA, NA, NA, NA, NA, NA,...
## $ bounces <int> NA, NA, NA, NA, NA, NA, NA, NA,...
For the prediction task, we first study the correlation between different variables. Then, the correlation between the variable transactionRevunue and other explanatory variables are shown. We display the top 10 variables which have the highest positive correlation and the top 10 variables which have the highest negative correlation.
## Delete some outcome or irregular variables
dat <- dat %>% dplyr::select(-transactions, -totalTransactionRevenue, -adwordsClickInfo.isVideoAd,
-visitStartTime, -referralPath)
## Compute the correlation between transactionRevenue and other variables
y <- dat$transactionRevenue
y[which(is.na(y))] = 0
m <- dat %>% mutate(date = ymd(date)) %>% mutate(year = year(date), month = month(date),
day = day(date), isMobile = ifelse(isMobile, 1L, 0L), isTrueDirect = ifelse(isMobile,
1L, 0L)) %>% mutate_all(funs(ifelse(is.na(.), 0, .))) %>% dplyr::select(-date,
-fullVisitorId, -visitId) %>% mutate_if(is.character, factor) %>% mutate_if(is.factor,
fct_lump, prop = 0.01) %>% model.matrix(~. - 1, .) %>% cor(y) %>% data.table::as.data.table(keep.rownames = TRUE) %>%
set_names("Feature", "rho") %>% arrange(-rho)
m
## Feature
## 1 transactionRevenue
## 2 hits1
## 3 pageviews
## 4 sessionQualityDim
## 5 timeOnSite
## 6 channelGroupingReferral
## 7 operatingSystemMacintosh
## 8 browserChrome
## 9 cityNew York
## 10 regionNew York
## 11 metroNew York NY
## 12 operatingSystemChrome OS
## 13 visitNumber
## 14 cityChicago
## 15 metroChicago IL
## 16 regionIllinois
## 17 keyword0
## 18 citySan Francisco
## 19 adContentGoogle Merchandise Collection
## 20 networkDomaincomcastbusiness.net
## 21 browserFirefox
## 22 channelGroupingDirect
## 23 citySeattle
## 24 networkDomaincomcast.net
## 25 campaignAW - Accessories
## 26 regionCalifornia
## 27 metroLos Angeles CA
## 28 cityLos Angeles
## 29 metroSeattle-Tacoma WA
## 30 regionWashington
## 31 metroSan Francisco-Oakland-San Jose CA
## 32 citySunnyvale
## 33 cityMountain View
## 34 regionOther
## 35 networkDomainverizon.net
## 36 day
## 37 adwordsClickInfo.adNetworkTypeOther
## 38 networkDomainoptonline.net
## 39 networkDomainsbcglobal.net
## 40 metroOther
## 41 cityOther
## 42 adwordsClickInfo.slotOther
## 43 keyword6qEhsCssdK0z36ri
## 44 adwordsClickInfo.slotTop
## 45 adwordsClickInfo.adNetworkTypeGoogle Search
## 46 operatingSystemOther
## 47 networkDomaincox.net
## 48 regionTexas
## 49 mediumOther
## 50 campaignAW - Dynamic Search Ads Whole Site
## 51 keywordOther
## 52 networkDomainrr.com
## 53 browserOther
## 54 channelGroupingPaid Search
## 55 operatingSystemWindows
## 56 citySan Jose
## 57 year
## 58 sourcesites.google.com
## 59 browserInternet Explorer
## 60 channelGroupingOther
## 61 browserEdge
## 62 networkDomainatt.net
## 63 operatingSystemLinux
## 64 browserSafari (in-app)
## 65 sourceanalytics.google.com
## 66 sourceOther
## 67 networkDomaingoogle.com
## 68 browserSamsung Internet
## 69 month
## 70 adContentOther
## 71 campaignOther
## 72 campaign1000557 | GA | US | en | Hybrid | GDN Remarketing
## 73 campaign1000557 | GA | US | en | Hybrid | GDN Text+Banner | AS
## 74 keyword(User vertical targeting)
## 75 networkDomainOther
## 76 deviceCategorytablet
## 77 sourceyoutube.com
## 78 adwordsClickInfo.page
## 79 adwordsClickInfo.gclIdOther
## 80 channelGroupingSocial
## 81 channelGroupingDisplay
## 82 mediumcpc
## 83 adContentGoogle Merchandise Store
## 84 mediumreferral
## 85 adwordsClickInfo.slotRHS
## 86 adwordsClickInfo.adNetworkTypeContent
## 87 mediumorganic
## 88 channelGroupingOrganic Search
## 89 browserSafari
## 90 operatingSystemiOS
## 91 sourcegoogle
## 92 deviceCategorymobile
## 93 isMobile
## 94 isTrueDirect
## 95 newVisits
## 96 bounces
## rho
## 1 1.000000e+00
## 2 2.572123e-01
## 3 2.542981e-01
## 4 2.284239e-01
## 5 2.012814e-01
## 6 4.994584e-02
## 7 4.891075e-02
## 8 3.429162e-02
## 9 2.611505e-02
## 10 2.597448e-02
## 11 2.576536e-02
## 12 2.363181e-02
## 13 2.340780e-02
## 14 1.803973e-02
## 15 1.802497e-02
## 16 1.792326e-02
## 17 1.765122e-02
## 18 5.345817e-03
## 19 3.993315e-03
## 20 3.900645e-03
## 21 3.304409e-03
## 22 3.198217e-03
## 23 3.012843e-03
## 24 2.620727e-03
## 25 2.462379e-03
## 26 2.375522e-03
## 27 2.295679e-03
## 28 1.648290e-03
## 29 1.578737e-03
## 30 1.441443e-03
## 31 1.425087e-03
## 32 7.450421e-04
## 33 7.392058e-04
## 34 5.532255e-04
## 35 4.358264e-04
## 36 4.661613e-05
## 37 -1.524945e-04
## 38 -5.398091e-04
## 39 -5.399487e-04
## 40 -6.035017e-04
## 41 -7.373767e-04
## 42 -7.924281e-04
## 43 -1.268255e-03
## 44 -1.465634e-03
## 45 -1.568651e-03
## 46 -2.388147e-03
## 47 -2.672940e-03
## 48 -2.742028e-03
## 49 -2.776007e-03
## 50 -2.855815e-03
## 51 -3.133678e-03
## 52 -3.844921e-03
## 53 -4.209133e-03
## 54 -4.828764e-03
## 55 -5.373874e-03
## 56 -5.428788e-03
## 57 -5.729795e-03
## 58 -6.015630e-03
## 59 -6.333125e-03
## 60 -6.877139e-03
## 61 -7.201328e-03
## 62 -7.217506e-03
## 63 -7.320697e-03
## 64 -7.841304e-03
## 65 -7.907972e-03
## 66 -8.357512e-03
## 67 -8.521454e-03
## 68 -9.072258e-03
## 69 -9.178907e-03
## 70 -9.467813e-03
## 71 -1.097939e-02
## 72 -1.221217e-02
## 73 -1.360255e-02
## 74 -1.389749e-02
## 75 -1.424669e-02
## 76 -1.453687e-02
## 77 -1.508312e-02
## 78 -1.624744e-02
## 79 -1.651475e-02
## 80 -1.679876e-02
## 81 -1.708103e-02
## 82 -1.788903e-02
## 83 -1.856387e-02
## 84 -1.919846e-02
## 85 -1.929426e-02
## 86 -1.930600e-02
## 87 -2.240342e-02
## 88 -2.335713e-02
## 89 -2.860518e-02
## 90 -3.153411e-02
## 91 -3.261701e-02
## 92 -4.506799e-02
## 93 -4.999592e-02
## 94 -4.999592e-02
## 95 -5.927451e-02
## 96 -6.239677e-02
The result shows that the varialbe hits1, pageviews, sessionQualityDim and timeOnSite have correlation significantly larger than the other variables, which also correspond to our intuition.
The result also shows that the variable bounces, newVisits, isTrueDirect, isMobile has the largest negative correlation with the variable transactionRevenue.
By the EDA, we find that there are a large porpotion amoung of missing data, so the method we use is that if there exists a missing data, we set that data to be zero, and then, for the data of character form, we transform it into factor, and for the factor data, we just flatten it into several 0 and 1 columns.
## Data matrix to do the linear regression
dat.mat <- dat %>% mutate(date = ymd(date)) %>% mutate(year = year(date), month = month(date),
day = day(date), isMobile = ifelse(isMobile, 1L, 0L), isTrueDirect = ifelse(isMobile,
1L, 0L)) %>% mutate_all(funs(ifelse(is.na(.), 0, .))) %>% dplyr::select(-date,
-fullVisitorId, -visitId) %>% mutate_if(is.character, factor) %>% mutate_if(is.factor,
fct_lump, prop = 0.01)
# dat.mat <- dat %>% dplyr::select(transactionRevenue, hits1, pageviews,
# sessionQualityDim, timeOnSite, channelGrouping, operatingSystem,
# visitNumber, city, region, metro) %>% mutate_all(funs(ifelse(is.na(.), 0,
# .))) %>% mutate_if(is.character, factor) %>% mutate_if(is.factor,
# fct_lump, prop = 0.01)
train.index <- sample(1:nrow(dat.mat), 0.7 * nrow(dat.mat), replace = FALSE)
dat.train <- dat.mat[train.index, ]
dat.test <- dat.mat[-train.index, ]
To predict the transactionRevenue, we apply five different prediction method, Linear Regression, Logistic + Linear Regression, Deep Learning, Lasso and XGBoost. The detailed implementation is illustrated below.
For the Linear regression method, we directly apply the linear regression to regress the log(transactionRevenue + 1) varaible to other explanatory variables and measure the mean sqaured error(MSE).
model_lm <- lm(log(1 + transactionRevenue) ~ ., data = as.data.frame(dat.train))
# summary(model_lm)
pred.res <- predict(model_lm, as.data.frame(dat.test %>% dplyr::select(-transactionRevenue)))
pred.res[which(pred.res < 0)] = 0
loss.lm <- sum((log(dat.test$transactionRevenue + 1) - pred.res)^2)/length(pred.res)
print(paste0("The loss for the linear regression is ", round(loss.lm, 4)))
## [1] "The loss for the linear regression is 5.2628"
For the Logistic + Lienar Regression method, we develop a two stage method, the first step is fit a logistic regresssion to predict whether the customer will buy the certain products, if the prediciton result is false, the the transactionRevenue for that customer is 0, else we continue use the linear regression trained on the customers who buy the products to predict that the transactionRevenue will be. Then, we compute the MSE for the prediction.
## Fit a logistic regression model to find out which customer buys items
model_logistic <- glm(factor(as.numeric(transactionRevenue > 0)) ~ ., data = dat.train,
family = "binomial")
predict.buy <- predict(model_logistic, dat.test %>% dplyr::select(-transactionRevenue),
type = "response")
## Fit linear regression model for the buy option data
model_lm <- lm(log(1 + transactionRevenue) ~ ., data = as.data.frame(dat.train %>%
filter(transactionRevenue > 0)))
## The index which has predict.buy larger than 0.5 and the prediciton result
index = which(predict.buy > 0.5)
pred.res <- predict(model_lm, dat.test[index, ])
predicition.index <- rep(0, nrow(dat.test))
predicition.index[index] = pred.res
pred.res = predicition.index
pred.res[which(pred.res < 0)] = 0
## Loss for the linear regression
loss.logistic_lm <- sum((log(dat.test$transactionRevenue + 1) - pred.res)^2)/length(pred.res)
print(paste0("The loss for the logistic + linear regression is ", round(loss.logistic_lm,
4)))
## [1] "The loss for the logistic + linear regression is 6.8061"
For the Deep Learning, we use R keras to do the training, the structure of the model is “Input -> Dense(32, activation = “Relu”) -> Dropout(rate = 0.1) -> Dense(16, activation = “Relu”) -> Dropout(rate = 0.1) -> Dense(1, activation = “Linear)”. The optimizer use is rmsprop and training with 500 epochs for the use here. The training process is shown below
dat.keras.x <- dat.mat %>% dplyr::select(-transactionRevenue) %>% mutate_if(is.factor,
fct_explicit_na) %>% mutate_if(is.numeric, funs(ifelse(is.na(.), 0L, .))) %>%
mutate_if(is.factor, fct_lump, prop = 0.05) %>% model.matrix(~. - 1, .)
dat.keras.y <- dat.mat %>% dplyr::select(transactionRevenue)
## Dataset for training model
dat.keras.train.x <- dat.keras.x[train.index, ]
dat.keras.train.y <- pull(dat.keras.y[train.index, ])
## Data for testing the model
dat.keras.test.x <- dat.keras.x[-train.index, ]
dat.keras.test.y <- pull(dat.keras.y[-train.index, ])
## Train the neural network model
model_nn <- keras_model_sequential()
model_nn %>% layer_dense(units = 32, activation = "relu", input_shape = ncol(dat.keras.x)) %>%
layer_dropout(rate = 0.1) %>% layer_dense(units = 16, activation = "relu") %>%
layer_dropout(rate = 0.1) %>% layer_dense(units = 1, activation = "linear")
model_nn %>% compile(loss = "mean_squared_error", optimizer = optimizer_rmsprop())
history <- model_nn %>% fit(dat.keras.train.x, log(dat.keras.train.y + 1), epochs = 100,
batch_size = 4096, verbose = 1, validation_split = 0.2)
plot(history)
pred.res <- predict(model_nn, dat.keras.test.x)
pred.res[which(pred.res < 0)] = 0
loss.nn <- sum((log(dat.test$transactionRevenue + 1) - pred.res)^2)/length(pred.res)
print(paste0("The loss for the neural network is ", round(loss.nn, 4)))
## [1] "The loss for the neural network is 4.7157"
For the Lasso predicition, we use the glmnet package, to apply the lasso, we first flatten the matrix to make factor columns into 0 and 1 columns and then scale the explanatory variables, the cv.glmnet function is then used to do the prediction. The lambda which has the lowest validaiton error is used to fit the final modle and do the prediction. The relationship between MSE and lambda is shown below.
## The training dataset for x and y
dat.glmnet.x <- dat.mat %>% dplyr::select(-transactionRevenue) %>% # mutate_if(is.factor, fct_explicit_na) %>% mutate_if(is.numeric,
# funs(ifelse(is.na(.), 0L, .))) %>% mutate_if(is.factor, fct_lump, prop =
# 0.05) %>%
model.matrix(~. - 1, .) %>% scale() %>% round(4)
dat.glmnet.y <- dat.mat %>% dplyr::select(transactionRevenue)
## Training dataset for the lasso
dat.glmnet.train.x <- dat.glmnet.x[train.index, ]
dat.glmnet.train.y <- dat.glmnet.y[train.index, ]
## Testing dataset for the lasso
dat.glmnet.test.x <- dat.glmnet.x[-train.index, ]
dat.glmnet.test.y <- dat.glmnet.y[-train.index, ]
## Fitting the lasso
model_lasso <- cv.glmnet(dat.glmnet.x, (log(pull(dat.glmnet.y) + 1)), family = "gaussian",
alpha = 0, nlambda = 100, type.measure = "mse")
pred.res <- predict(model_lasso, dat.glmnet.test.x, s = model_lasso$lambda.min,
type = "response")
plot(model_lasso, xvar = "lambda", label = TRUE)
# fit = glmnet(x, y, alpha = 0.2, weights = c(rep(1,50),rep(2,50)), nlambda
# = 20)
loss.glmnet <- sum((log(dat.test$transactionRevenue + 1) - pred.res)^2)/length(pred.res)
print(paste0("The loss for the lasso regression is ", round(loss.glmnet, 4)))
## [1] "The loss for the lasso regression is 5.3087"
For the XGBoost method, we use the XGBoost package, here we use the gblinear booster when a total rounds less than 2000.
## Data matrix to do the xgboosting
dat.xgb.x <- dat.mat %>% dplyr::select(-transactionRevenue) %>% mutate_if(is.factor,
as.integer) %>% glimpse()
## Observations: 473,480
## Variables: 29
## $ channelGrouping <int> 5, 1, 5, 3, 4, 3, 3, 3, 4, 3, 1...
## $ visitNumber <int> 6, 1, 1, 1, 10, 1, 1, 1, 1, 1, ...
## $ browser <int> 2, 2, 2, 6, 6, 5, 2, 6, 2, 2, 6...
## $ operatingSystem <int> 2, 1, 5, 3, 3, 6, 1, 3, 1, 1, 3...
## $ isMobile <int> 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1...
## $ deviceCategory <int> 1, 2, 1, 2, 2, 1, 2, 2, 2, 2, 2...
## $ region <int> 2, 1, 2, 1, 7, 2, 2, 1, 1, 1, 1...
## $ metro <int> 5, 1, 5, 1, 1, 3, 5, 1, 1, 1, 1...
## $ city <int> 10, 1, 6, 1, 1, 3, 6, 1, 1, 1, ...
## $ networkDomain <int> 1, 11, 1, 11, 2, 11, 1, 11, 3, ...
## $ campaign <int> 1, 1, 1, 1, 6, 1, 1, 1, 5, 1, 1...
## $ source <int> 4, 1, 1, 3, 3, 3, 3, 3, 3, 3, 1...
## $ medium <int> 4, 1, 1, 3, 2, 3, 3, 3, 2, 3, 1...
## $ isTrueDirect <int> 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1...
## $ keyword <int> 3, 3, 3, 3, 5, 3, 3, 3, 4, 3, 3...
## $ adContent <int> 1, 1, 1, 1, 4, 1, 1, 1, 1, 1, 1...
## $ adwordsClickInfo.page <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0...
## $ adwordsClickInfo.slot <int> 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1...
## $ adwordsClickInfo.gclId <int> 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1...
## $ adwordsClickInfo.adNetworkType <int> 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1...
## $ hits1 <int> 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3...
## $ pageviews <dbl> 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3...
## $ timeOnSite <dbl> 28, 38, 12, 51, 111, 8, 97, 72,...
## $ sessionQualityDim <dbl> 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1...
## $ newVisits <dbl> 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0...
## $ bounces <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ year <dbl> 2017, 2017, 2017, 2017, 2017, 2...
## $ month <dbl> 10, 10, 10, 10, 10, 10, 10, 10,...
## $ day <int> 16, 16, 16, 16, 16, 16, 16, 16,...
dat.xgb.y <- dat.mat %>% dplyr::select(transactionRevenue)
val.index <- sample(1:length(train.index), 0.1 * length(train.index), replace = FALSE)
## The training data matrix to do xg-boosting
dat.xgb.train.x <- xgb.DMatrix(data = data.matrix(dat.xgb.x[train.index[-val.index],
]), label = log(1 + pull(dat.xgb.y[train.index[-val.index], ])))
## The validation data matrix to do xg-boosting
dat.xgb.val.x <- xgb.DMatrix(data = data.matrix(dat.xgb.x[train.index[val.index],
]), label = log(1 + pull(dat.xgb.y[train.index[val.index], ])))
## The testing data matrix to do xg-boosting
dat.xgb.test.x <- xgb.DMatrix(data = data.matrix(dat.xgb.x[-train.index, ]))
p <- list(objective = "reg:linear", booster = "gblinear", eval_metric = "rmse",
nthread = 4, eta = 0.05, max_depth = 20, min_child_weight = 1, gamma = 0,
subsample = 0.8, colsample_bytree = 1, nrounds = 2000)
model_xgb <- xgb.train(p, dat.xgb.train.x, p$nrounds, list(val = dat.xgb.val.x),
print_every_n = 100, early_stopping_rounds = 100)
## [1] val-rmse:2.524167
## Will train until val_rmse hasn't improved in 100 rounds.
##
## [101] val-rmse:2.277938
## [201] val-rmse:2.276289
## [301] val-rmse:2.275171
## [401] val-rmse:2.274259
## [501] val-rmse:2.273488
## [601] val-rmse:2.272826
## [701] val-rmse:2.272254
## [801] val-rmse:2.271758
## [901] val-rmse:2.271326
## [1001] val-rmse:2.270949
## [1101] val-rmse:2.270620
## [1201] val-rmse:2.270332
## [1301] val-rmse:2.270080
## [1401] val-rmse:2.269859
## [1501] val-rmse:2.269665
## [1601] val-rmse:2.269495
## [1701] val-rmse:2.269344
## [1801] val-rmse:2.269212
## [1901] val-rmse:2.269095
## [2000] val-rmse:2.268992
pred.res <- predict(model_xgb, dat.xgb.test.x)
loss.xgb <- sum((log(dat.xgb.y[-train.index, ] + 1) - pred.res)^2)/length(pred.res)
print(paste0("The loss for the XGB is ", round(loss.xgb, 4)))
## [1] "The loss for the XGB is 5.3339"
And here is the loss summary for every model
loss <- c(loss.lm, loss.logistic_lm, loss.nn, loss.glmnet, loss.xgb)
index <- c("lm", "logistic_lm", "neural network", "glmnet", "xgb")
loss.frame <- data.frame(loss, index)
loss.frame %>% ggplot(aes(index, loss)) + geom_bar(stat = "identity")
You will find that Deep Learning outperforms all the other methods by a margin. Linear Regression, Lasso and XGBoost has similar performance which may be accounted for the reason that the samples size is much larger than the number explanatory variables that variable selection is useless is this problem. Logistic + Lienar Regression performs worst, which may be for the reason that the logistic regression predicts poorly for whether the customer will buy or not.
library(randomForest)
load("./dataset/full.Rdata")
glimpse(full)
## Observations: 374,669
## Variables: 71
## $ fullVisitorId <chr> "0000166374699289385", "...
## $ visitId <chr> "1502294956", "150229495...
## $ hitNumber <dbl> 1, 2, 3, 4, 5, 1, 2, 3, ...
## $ has.transaction <lgl> FALSE, FALSE, FALSE, FAL...
## $ channelGrouping <chr> "Organic Search", "Organ...
## $ date <int> 20170809, 20170809, 2017...
## $ visitNumber <int> 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ visitStartTime <int> 1502294956, 1502294956, ...
## $ browser <chr> "Chrome", "Chrome", "Chr...
## $ operatingSystem <chr> "Macintosh", "Macintosh"...
## $ isMobile <lgl> FALSE, FALSE, FALSE, FAL...
## $ deviceCategory <chr> "desktop", "desktop", "d...
## $ region <chr> "Illinois", "Illinois", ...
## $ metro <chr> "Chicago IL", "Chicago I...
## $ city <chr> "Chicago", "Chicago", "C...
## $ networkDomain <chr> NA, NA, NA, NA, NA, "att...
## $ referralPath <chr> NA, NA, NA, NA, NA, NA, ...
## $ campaign <chr> NA, NA, NA, NA, NA, "AW ...
## $ source <chr> "google", "google", "goo...
## $ medium <chr> "organic", "organic", "o...
## $ isTrueDirect <lgl> NA, NA, NA, NA, NA, NA, ...
## $ keyword <chr> NA, NA, NA, NA, NA, "6qE...
## $ adContent <chr> NA, NA, NA, NA, NA, NA, ...
## $ adwordsClickInfo.page <int> NA, NA, NA, NA, NA, 1, 1...
## $ adwordsClickInfo.slot <chr> NA, NA, NA, NA, NA, "Top...
## $ adwordsClickInfo.gclId <chr> NA, NA, NA, NA, NA, "Cjw...
## $ adwordsClickInfo.adNetworkType <chr> NA, NA, NA, NA, NA, "Goo...
## $ adwordsClickInfo.isVideoAd <lgl> NA, NA, NA, NA, NA, FALS...
## $ hits1 <int> 5, 5, 5, 5, 5, 4, 4, 4, ...
## $ pageviews <int> 5, 5, 5, 5, 5, 4, 4, 4, ...
## $ timeOnSite <int> 41, 41, 41, 41, 41, 25, ...
## $ sessionQualityDim <int> 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ newVisits <int> 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ transactionRevenue <dbl> 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ time <chr> "0", "1779", "8655", "24...
## $ hour <chr> "9", "9", "9", "9", "9",...
## $ minute <chr> "9", "9", "9", "9", "9",...
## $ isEntrance <int> 1, NA, NA, NA, NA, 1, NA...
## $ referer <chr> "https://www.google.com/...
## $ type <chr> "PAGE", "PAGE", "PAGE", ...
## $ dataSource <chr> "web", "web", "web", "we...
## $ isExit <int> NA, NA, NA, NA, 1, NA, N...
## $ page.pagePathLevel1 <chr> "/home", "/home", "/goog...
## $ page.pagePathLevel2 <chr> "", "", "/bags/", "/appa...
## $ page.pagePathLevel3 <chr> "", "", "/backpacks/", "...
## $ page.pagePathLevel4 <chr> "", "", "/home", "", "",...
## $ appInfo.landingScreenName <chr> "www.googlemerchandisest...
## $ appInfo.exitScreenName <chr> "shop.googlemerchandises...
## $ promotionActionInfo.promoIsView <int> NA, 1, NA, NA, NA, NA, 1...
## $ eCommerceAction.action_type <chr> "0", "0", "0", "0", "0",...
## $ eCommerceAction.step <chr> "1", "1", "1", "1", "1",...
## $ social.hasSocialSourceReferral <chr> "No", "No", "No", "No", ...
## $ contentGroup.contentGroup1 <chr> NA, NA, NA, NA, NA, NA, ...
## $ contentGroup.contentGroup2 <chr> NA, NA, "Bags", "Apparel...
## $ contentGroup.contentGroup3 <chr> NA, NA, NA, NA, NA, NA, ...
## $ contentGroup.previousContentGroup1 <chr> "(entrance)", NA, NA, NA...
## $ contentGroup.previousContentGroup2 <chr> "(entrance)", NA, NA, "B...
## $ contentGroup.previousContentGroup3 <chr> "(entrance)", NA, NA, NA...
## $ contentGroup.previousContentGroup4 <chr> "(entrance)", NA, NA, NA...
## $ contentGroup.previousContentGroup5 <chr> "(entrance)", NA, NA, NA...
## $ contentGroup.contentGroupUniqueViews2 <chr> NA, NA, "1", "1", "1", N...
## $ contentGroup.contentGroupUniqueViews1 <chr> NA, NA, NA, NA, NA, NA, ...
## $ contentGroup.contentGroupUniqueViews3 <chr> NA, NA, NA, NA, NA, NA, ...
## $ eventInfo.eventCategory <chr> NA, NA, NA, NA, NA, NA, ...
## $ eventInfo.eventAction <chr> NA, NA, NA, NA, NA, NA, ...
## $ eventInfo.eventLabel <chr> NA, NA, NA, NA, NA, NA, ...
## $ price_q1 <dbl> NA, NA, 69990000, 109900...
## $ price_q2 <dbl> NA, NA, 99990000, 109900...
## $ price_q3 <dbl> NA, NA, 99990000, 189900...
## $ price_q4 <dbl> NA, NA, 99990000, 189900...
## $ price_q5 <dbl> NA, NA, 99990000, 219900...
full = full %>% mutate(ID = paste0(fullVisitorId, visitId))
Let’s wait and have a sight of what good are people buying.
load("./dataset/hits.Rdata")
hits_data = hits_data %>% mutate(ID = paste0(fullVisitorId, visitId))
buy.id = hits_data %>% group_by(ID) %>% summarise(buy = max(!is.na(transaction.transactionId))) %>%
filter(buy > 0)
buy.id = buy.id$ID
buydata = hits_data %>% filter(ID %in% buy.id)
show_buy = head(buydata, n = 50)
product_name = c()
for (i in 1:nrow(buydata)) {
if (!is_empty(buydata$product[[i]])) {
product_name = unique(c(product_name, buydata$product[[i]]$v2ProductName))
}
}
buycount = data.frame(name = product_name, count = 0)
for (i in 1:nrow(buydata)) {
if (!is.na(buydata$eventInfo.eventAction[i]) & buydata$eventInfo.eventAction[i] ==
"Add to Cart") {
for (name in buydata$product[[i]]$v2ProductName) {
buycount$count[buycount$name == name] = buycount$count[buycount$name ==
name] + 1
}
}
}
colormap = function(x) {
col = c()
for (rate in x) {
if (rate > 4)
col = c(col, "red") else if (rate > 2.62)
col = c(col, "orange") else if (rate > 2.6)
col = c(col, "blue") else col = c(col, "black")
}
col
}
buycount %>% mutate(`Ratio (%)` = count/sum(count) * 100) %>% dplyr::select(-count) %>%
arrange(desc(`Ratio (%)`)) %>% head(n = 10) %>% rename(`Good Name` = name) %>%
mutate(`Ratio (%)` = cell_spec(`Ratio (%)`, color = colormap(`Ratio (%)`))) %>%
kable(escape = F) %>% kable_styling(bootstrap_options = "striped", full_width = F,
position = "left")
| Good Name | Ratio (%) |
|---|---|
| Google Sunglasses | 4.11655874190564 |
| Google 22 oz Water Bottle | 2.63644773358002 |
| Google Twill Cap | 2.61332099907493 |
| Google Laptop and Cell Phone Stickers | 2.56706753006475 |
| Kids Take Your Child To Work Day Tee (Pre-Order) | 2.01202590194265 |
| Google Metallic Notebook Set | 1.82701202590194 |
| Maze Pen | 1.78075855689177 |
| Google Tee White | 1.50323774283071 |
| YouTube Tee Black | 1.48011100832562 |
| Windup Android | 1.45698427382054 |
topten = (buycount %>% arrange(desc(count)))["name"]$name[1:10] %>% as.character()
We find that Google Sunglasses is the best seller, following by Google 22 oz Water Bottle and Google Twill Cap.
Let’s try to describe some detail of the transaction behavior:
#### Transaction won't happen when people are viewing the contents
table(Content = full$contentGroup.contentGroup2, Transaction = full$has.transaction,
useNA = "ifany") %>% kable() %>% kable_styling(bootstrap_options = "striped",
full_width = F, position = "left") %>% add_header_above(c(Content = 1, `Had Transaction` = 2))
| FALSE | TRUE | |
|---|---|---|
| Accessories | 23819 | 0 |
| Apparel | 93976 | 0 |
| Bags | 33194 | 0 |
| Brands | 28030 | 0 |
| Drinkware | 16505 | 0 |
| Electronics | 11648 | 0 |
| Nest | 2179 | 0 |
| Office | 22071 | 0 |
| NA | 141735 | 1512 |
#### However, lot's of people do exit at the content page
table(full$contentGroup.contentGroup2, isExit = full$isExit, useNA = "ifany") %>%
kable() %>% kable_styling(bootstrap_options = "striped", full_width = F,
position = "left") %>% add_header_above(c(`Content ` = 1, `Exit Page ?` = 2))
| 1 | NA | |
|---|---|---|
| Accessories | 3358 | 20461 |
| Apparel | 12699 | 81277 |
| Bags | 3617 | 29577 |
| Brands | 8725 | 19305 |
| Drinkware | 2187 | 14318 |
| Electronics | 2023 | 9625 |
| Nest | 1078 | 1101 |
| Office | 2789 | 19282 |
| NA | 34004 | 109243 |
#### Also there will be no transaction at entrance page
table(full$has.transaction, isEntrance = full$isEntrance, useNA = "ifany") %>%
kable() %>% kable_styling(bootstrap_options = "striped", full_width = F,
position = "left") %>% add_header_above(c(`Transactions?` = 1, `Entrance Page?` = 2))
| 1 | NA | |
|---|---|---|
| FALSE | 70480 | 302677 |
| TRUE | 0 | 1512 |
### Which suggest that people just viewing content after entrance and exit
### then can not make revenue
viewer = full %>% group_by(fullVisitorId, visitId) %>% summarise(buy = max(transactionRevenue >
0), just.view = min(!is.na(isEntrance) | page.pagePathLevel1 == "/home" |
!is.na(contentGroup.contentGroup2)))
table(transaction = viewer$buy, just_view = viewer$just.view) %>% kable() %>%
kable_styling(bootstrap_options = "striped", full_width = F, position = "left") %>%
add_header_above(c(`Transaction?` = 1, `Just Viewing?` = 2))
| 0 | 1 | |
|---|---|---|
| 0 | 13666 | 55366 |
| 1 | 1457 | 0 |
viewer = viewer %>% mutate(ID = paste0(fullVisitorId, visitId)) %>% dplyr::select(-buy)
### Which also suggest there should be some tag that is about transaction Yes!
### eventInfo.eventAction is discribing people's action in the website
unique(full$eventInfo.eventAction)
## [1] NA "Quickview Click" "Add to Cart"
## [4] "Onsite Click" "Product Click" "Remove from Cart"
## [7] "Promotion Click"
addcart = full %>% group_by(fullVisitorId, visitId) %>% summarise(buy = max(transactionRevenue >
0), no_na = max(!is.na(eventInfo.eventAction)), add = sum(!is.na(eventInfo.eventAction) &
eventInfo.eventAction == "Add to Cart"), delete = sum(!is.na(eventInfo.eventAction) &
eventInfo.eventAction == "Remove from Cart"))
table(transaction = addcart$buy, Add_to_chart = addcart$add > 0) %>% kable() %>%
kable_styling(bootstrap_options = "striped", full_width = F, position = "left") %>%
add_header_above(c(`Transaction?` = 1, `Add to Cart?` = 2))
| FALSE | TRUE | |
|---|---|---|
| 0 | 65400 | 3632 |
| 1 | 192 | 1265 |
table(transaction = addcart$buy, Remove_from_chart = addcart$delete > 0) %>%
kable() %>% kable_styling(bootstrap_options = "striped", full_width = F,
position = "left") %>% add_header_above(c(`Transaction?` = 1, `Remove from Cart?` = 2))
| FALSE | TRUE | |
|---|---|---|
| 0 | 68330 | 702 |
| 1 | 1141 | 316 |
table(transaction = addcart$buy, Actions = addcart$no_na > 0) %>% kable() %>%
kable_styling(bootstrap_options = "striped", full_width = F, position = "left") %>%
add_header_above(c(`Transaction?` = 1, `Had Actions?` = 2))
| FALSE | TRUE | |
|---|---|---|
| 0 | 56000 | 13032 |
| 1 | 117 | 1340 |
## Is social referral important for transaction? Yes, but referal seems to
## make people not to buy But may just beacuse of variance
social = full %>% group_by(fullVisitorId, visitId) %>% summarise(buy = max(transactionRevenue >
0), social = max(social.hasSocialSourceReferral == "Yes"))
table(buy = social$buy, social_referal = social$social) %>% kable() %>% kable_styling(bootstrap_options = "striped",
full_width = F, position = "left") %>% add_header_above(c(`Transaction?` = 1,
`Socail Referral?` = 2))
| 0 | 1 | |
|---|---|---|
| 0 | 65107 | 3925 |
| 1 | 1451 | 6 |
As the result, we find that transactions won’t happen when people are viewing the contents. However, lot’s of people do exit at the content page. Also there will be no transaction at entrance page. All these Which suggest that people just viewing content after entrance and exit then can not make revenue. And that is true (see the table of just viewing), which also suggest there should be some tag that is about transaction. And, yes! eventInfo.eventAction is discribing people’s action in the website. It will describe wether a visitor is clicking goods pictures, adding good to cart and removing good from cart, which are strong predictors (but not even as good as just.view). Also we find that social referral do have some effect on result, but unfortunately, it’s a negetive effect. That’s strange, but it might because social networks force you enter the shop page therefore you are less intend to buy goods compare to those enter the page by them own. Also it might because the variation, since people bought goods is of tiny amount.
The exploring suggests that we should use reduce the external dataset according to these features. And these are all explained in data section. Also we use this summerised data to do prediction to see that if they are powerful. I use randomforest here, because this section is to show the power of data, we just do a simple bagging to avoid overfitting and make result more stable (since the feature are increasing and levels num are big)
### So we choose following reduction
hits = full[, 35:72]
hits$time = as.numeric(hits$time)
hits$hour = as.numeric(hits$hour)
hits$price_q1[is.na(hits$price_q1)] = 0
hits$price_q3[is.na(hits$price_q3)] = 0
hits$price_q5[is.na(hits$price_q5)] = 0
hits$contentGroup.contentGroup2[is.na(hits$contentGroup.contentGroup2)] = "other"
hits$eventInfo.eventAction[is.na(hits$eventInfo.eventAction)] = "no action"
reduced = hits %>% group_by(ID) %>% summarise(time_q1 = min(time), time_q2 = quantile(time,
0.333) - time_q1, time_q3 = quantile(time, 0.666) - time_q2, time_q4 = max(time) -
time_q3, hour = median(hour), event_count = sum(type == "EVENT"), promotion_count = sum(!is.na(promotionActionInfo.promoIsView)),
social_count = mean(social.hasSocialSourceReferral == "Yes"), Bags_count = sum(contentGroup.contentGroup2 ==
"Bags"), Apparel_count = sum(contentGroup.contentGroup2 == "Apparel"),
Electronics_count = sum(contentGroup.contentGroup2 == "Electronics"), Brands_count = sum(contentGroup.contentGroup2 ==
"Brands"), Office_count = sum(contentGroup.contentGroup2 == "Office"),
Accessories_count = sum(contentGroup.contentGroup2 == "Accessories"), Drinkware_count = sum(contentGroup.contentGroup2 ==
"Drinkware"), Nest_count = sum(contentGroup.contentGroup2 == "Nest"),
Topten_count = sum(eventInfo.eventLabel %in% topten), Click_count = sum(eventInfo.eventAction %in%
c("Quickview Click", "Onsite Click", "Product Click", "Promotion Click")),
Add_count = sum(eventInfo.eventAction == "Add to Cart"), Remove_count = sum(eventInfo.eventAction ==
"Remove from Cart"), price_q1 = median(price_q1[eventInfo.eventAction ==
"Add to Cart"]), price_q3 = median(price_q3[eventInfo.eventAction ==
"Add to Cart"]), price_q5 = median(price_q5[eventInfo.eventAction ==
"Add to Cart"]))
reduced$price_q1[is.na(reduced$price_q1)] = 0
reduced$price_q3[is.na(reduced$price_q3)] = 0
reduced$price_q5[is.na(reduced$price_q5)] = 0
reduced$price_q1 = log(1 + reduced$price_q1)
reduced$price_q3 = log(1 + reduced$price_q3)
reduced$price_q5 = log(1 + reduced$price_q5)
reduced = merge(reduced, viewer, by = "ID")
### Let's see how can reduce itself perform for predicting
revenue = full %>% group_by(ID) %>% summarise(transactionRevenue = log(1 + max(transactionRevenue)),
hits1 = max(hits1))
from_hits = merge(reduced, revenue, by = "ID")
glimpse(from_hits)
## Observations: 70,489
## Variables: 29
## $ ID <chr> "00001663746992893851502294956", "000030387...
## $ time_q1 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ time_q2 <dbl> 4061.832, 4064.931, 1153.179, 0.000, 0.000,...
## $ time_q3 <dbl> 14833.376, 19431.137, 1153.179, 0.000, 0.00...
## $ time_q4 <dbl> 26078.624, 5697.863, 2309.821, 0.000, 0.000...
## $ hour <dbl> 9.0, 18.0, 9.0, 8.0, 17.0, 9.0, 18.0, 6.0, ...
## $ event_count <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0...
## $ promotion_count <int> 1, 3, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0...
## $ social_count <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Bags_count <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0...
## $ Apparel_count <int> 1, 0, 0, 0, 1, 2, 0, 0, 1, 0, 1, 0, 0, 0, 0...
## $ Electronics_count <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Brands_count <int> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0...
## $ Office_count <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0...
## $ Accessories_count <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 1, 1...
## $ Drinkware_count <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0...
## $ Nest_count <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Topten_count <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Click_count <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0...
## $ Add_count <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Remove_count <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ price_q1 <dbl> 0.00000, 0.00000, 0.00000, 0.00000, 0.00000...
## $ price_q3 <dbl> 0.00000, 0.00000, 0.00000, 0.00000, 0.00000...
## $ price_q5 <dbl> 0.00000, 0.00000, 0.00000, 0.00000, 0.00000...
## $ fullVisitorId <chr> "0000166374699289385", "00003038793126460",...
## $ visitId <chr> "1502294956", "1503106022", "1511803848", "...
## $ just.view <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ transactionRevenue <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ hits1 <dbl> 5, 4, 2, 1, 1, 4, 1, 1, 3, 1, 14, 2, 2, 1, ...
train.ind = sample(nrow(from_hits), floor(0.7 * nrow(from_hits)), replace = F)
train = from_hits[train.ind, ] %>% dplyr::select(-ID, -time_q1)
test = from_hits[-train.ind, ] %>% dplyr::select(-ID, -time_q1)
rf = randomForest(transactionRevenue ~ ., data = train, ntree = 1000)
rf.pred = predict(rf, newdata = test)
mean((rf.pred - test$transactionRevenue)^2)
## [1] 3.798484
### And you can see this is already a little better
We should see that using this data already helps to get better answer.
Now let’s combine original data and our features extracted from hits data, and finalize our model
original = full %>% group_by(ID) %>% filter(hitNumber == max(hitNumber))
original = original[, 1:34] %>% dplyr::select(-hitNumber) %>% mutate(ID = paste0(fullVisitorId,
visitId)) %>% dplyr::select(-fullVisitorId, -visitId)
original$transactionRevenue = log(1 + original$transactionRevenue)
final.data = merge(original, reduced, by = "ID") %>% dplyr::select(-ID, -fullVisitorId,
-visitId, -has.transaction)
### then use same preprocess skills as in previous prediction
final <- final.data %>% mutate(date = ymd(date)) %>% mutate(year = year(date),
month = month(date), day = day(date), isMobile = ifelse(isMobile, 1L, 0L),
isTrueDirect = ifelse(isMobile, 1L, 0L)) %>% mutate_all(funs(ifelse(is.na(.),
0, .))) %>% dplyr::select(-date) %>% mutate_if(is.character, factor) %>%
mutate_if(is.factor, fct_lump, prop = 0.01)
glimpse(final)
## Observations: 70,489
## Variables: 56
## $ channelGrouping <fct> Organic Search, Paid Search, Or...
## $ visitNumber <int> 1, 1, 2, 1, 1, 1, 1, 2, 1, 2, 1...
## $ visitStartTime <int> 1502294956, 1503106022, 1511803...
## $ browser <fct> Chrome, Chrome, Chrome, Chrome,...
## $ operatingSystem <fct> Macintosh, Android, Android, Li...
## $ isMobile <int> 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1...
## $ deviceCategory <fct> desktop, mobile, mobile, deskto...
## $ region <fct> Illinois, New York, New York, N...
## $ metro <fct> Chicago IL, New York NY, New Yo...
## $ city <fct> Chicago, New York, New York, Ne...
## $ networkDomain <fct> 0, att.net, att.net, 0, 0, 0, O...
## $ referralPath <fct> 0, 0, 0, Other, 0, 0, 0, 0, 0, ...
## $ campaign <fct> 0, AW - Dynamic Search Ads Whol...
## $ source <fct> google, google, google, sites.g...
## $ medium <fct> organic, cpc, organic, referral...
## $ isTrueDirect <int> 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1...
## $ keyword <fct> 0, 6qEhsCssdK0z36ri, 0, 0, 0, 0...
## $ adContent <fct> 0, 0, 0, 0, 0, 0, Google Mercha...
## $ adwordsClickInfo.page <dbl> 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0...
## $ adwordsClickInfo.slot <fct> 0, Top, 0, 0, 0, 0, RHS, RHS, 0...
## $ adwordsClickInfo.gclId <fct> 0, Other, 0, 0, 0, 0, Other, Ot...
## $ adwordsClickInfo.adNetworkType <fct> 0, Google Search, 0, 0, 0, 0, C...
## $ adwordsClickInfo.isVideoAd <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ hits1 <int> 5, 4, 2, 1, 1, 4, 1, 1, 3, 1, 1...
## $ pageviews <dbl> 5, 4, 2, 1, 1, 4, 1, 1, 3, 1, 1...
## $ timeOnSite <dbl> 41, 25, 3, 0, 0, 40, 0, 0, 95, ...
## $ sessionQualityDim <dbl> 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1...
## $ newVisits <dbl> 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1...
## $ transactionRevenue <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ time_q1 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ time_q2 <dbl> 4061.832, 4064.931, 1153.179, 0...
## $ time_q3 <dbl> 14833.376, 19431.137, 1153.179,...
## $ time_q4 <dbl> 26078.624, 5697.863, 2309.821, ...
## $ hour <dbl> 9.0, 18.0, 9.0, 8.0, 17.0, 9.0,...
## $ event_count <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1...
## $ promotion_count <int> 1, 3, 1, 0, 0, 1, 1, 1, 1, 0, 0...
## $ social_count <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Bags_count <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3...
## $ Apparel_count <int> 1, 0, 0, 0, 1, 2, 0, 0, 1, 0, 1...
## $ Electronics_count <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Brands_count <int> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0...
## $ Office_count <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1...
## $ Accessories_count <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2...
## $ Drinkware_count <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1...
## $ Nest_count <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Topten_count <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Click_count <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1...
## $ Add_count <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Remove_count <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ price_q1 <dbl> 0.00000, 0.00000, 0.00000, 0.00...
## $ price_q3 <dbl> 0.00000, 0.00000, 0.00000, 0.00...
## $ price_q5 <dbl> 0.00000, 0.00000, 0.00000, 0.00...
## $ just.view <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ year <dbl> 2017, 2017, 2017, 2017, 2017, 2...
## $ month <dbl> 8, 8, 11, 7, 5, 12, 9, 9, 6, 6,...
## $ day <int> 9, 18, 27, 17, 17, 8, 23, 25, 2...
Here goes the final model, we use just.view to role out those just viewers (about 3/4) and hope this will help to increase the power of models (which might not help lot in tree based models). We also use randomForest here as the same reason above. We also plot the importance of features.
train.ind = sample(nrow(final), floor(0.7 * nrow(final)), replace = F)
train = final[train.ind, ] %>% dplyr::select(-time_q1)
test = final[-train.ind, ] %>% dplyr::select(-time_q1)
table(transaction = final$transactionRevenue > 0, just_view = final$just.view)
## just_view
## transaction 0 1
## FALSE 13666 55366
## TRUE 1457 0
train.sub = train %>% filter(just.view == 0)
rf.sub = randomForest(transactionRevenue ~ ., data = train.sub, ntree = 1000)
predict.revenue = rep(NA, nrow(test))
for (i in 1:nrow(test)) {
if (test$just.view[i] == 1) {
predict.revenue[i] = 0
} else {
predict.revenue[i] = predict(rf.sub, test[i, ])
}
}
mean((predict.revenue - test$transactionRevenue)^2)
## [1] 3.566199
data.frame(importance = rf.sub$importance[, 1], feature = rownames(rf.sub$importance)) %>%
arrange(desc(importance)) %>% ggplot(aes(x = reorder(feature, desc(importance)),
y = importance)) + geom_bar(stat = "identity", fill = "skyblue3") + xlab("Features") +
ylab("Importance") + ggtitle("Feature Importance Plot") + theme(axis.text.x = element_text(angle = 90,
hjust = 1))
You can see the result is also good but similar to just use summary hits data, maybe original data is not powerful compare to this one.
We learn that 1. representing data is always an open question. You need to fully explore the data than you can find the right way to do it. 2. Choosing models is not originally important, we can train them all and compare. But models typically will give reasonable result as you originally explore. Therefore, process to choose model will increase your efficiency 3. For ill-conditioned problems, fit another model to make the problem more regular might not improve the result (sequential modeling might introducing too much variation). But use some exploration result from the data to make the problem less ill-conditioned can help, which suggest us to fully explore the data before modeling and not using modeling itself to assist modeling.
We justify our answers through a randomly assigned test dataset
The limitatiion is that actually the dataset encode a lot of the web behavior and some of the features are happening across the visiting process, therefore some of the features may not be observed at the begining stage. Therefore as a pure prediction model, the useful case of this might be limited.